In
general, a handful of things need to be put together (that is, defined
and executed upon) as the basis for an overall disaster recovery
process or plan. The following list clearly identifies where you need
to start:
1. | Create
a disaster recovery execution tasks/run book. This should include all
steps to take to recover from a disaster and cover all system
components that need to be recovered.
|
2. | Arrange
for or procure a server/site to recover to. This should be a
configuration that can house what is needed to get you back online.
|
3. | Guarantee
that a complete database backup/recovery mechanism is in place
(including offsite/alternate site archive and retrieval of databases).
|
4. | Guarantee
that an application backup/recovery mechanism is in place (for example,
COM+ applications, .NET applications, web services, other application
components, and so on).
|
5. | Make
sure you can completely re-create and resynchronize your security
(Microsoft Active Directory, domain accounts, SQL Server
logins/passwords, and so on). We call this “security resynchronization
readiness.”
|
6. | Make
sure you can completely configure and open up network/communication
lines. This also includes ensuring that routers are configured
properly, IP addresses are made available, and so on.
|
7. | Train
your support personnel on all elements of recovery. You can never know
enough ways to recover a system. And it seems that a system never
recovers the same way twice.
|
8. | Plan
and execute an annual or bi-annual disaster recovery simulation. The
one or two days that you do this will pay you back a hundred times over
if a disaster actually occurs. And, remember, disasters come in many
flavors.
|
Many organizations have gone
to the concept of having hot alternate sites available via stretch
clustering or log shipping techniques. Costs can be high for some of
these advanced and highly redundant solutions.
The Focus of Disaster Recovery
If you create some very
solid, time-tested mechanisms for re-creating your SQL Server
environment, they will serve you well when you need them most.
Following are the things to focus on for disaster recovery:
Always
generate scripts for as much of your work as possible (anything created
using a wizard, SMSS, and so on). These scripts will save your hide.
They should include the following:
Complete replication buildup/breakdown scripts
Complete database creation scripts (DB, tables, indexes, views, and so on).
Complete SQL login, database user IDs and password scripts (including roles and other grants)
Linked/remote server setup (linked servers, remote logins)
Log shipping setup (source, target, and monitor servers)
Any custom SQL Agent tasks
Backup/restore scripts
Potentially other scripts, depending on what you have built on SQL Server
Make
sure you document all aspects of SQL database maintenance plans being
used. This includes frequencies, alerts, email addresses being notified
when errors occur, backup file/device locations, and so on.
Document all hardware/software configurations used:
Leverage sqldiag.exe for this (as described in the next section).
Record
what accounts were used to start up the SQL Agent service for an
instance and MS Distributed Transaction Coordinator (MS DTC) service.
This step is especially important if you’re using distributed
transactions and data replication.
The favorite SQL Server implementation characteristics that we script and record for a SQL Server instance are
select @@SERVERNAME— Provides the full network name of the SQL Server and instance.
select @@SERVICENAME— Provides the Registry key under which Microsoft SQL Server is running.
select @@VERSION— Provides the date, version, and processor type for the current installation of Microsoft SQL Server.
exec sp_helpserver—
Provides the server name; the server’s network name; the server’s
replication status; and the server’s identification number, collation
name, and time-out values for connecting to, or queries against, linked
servers.
exec sp_helplogins— Provides information about logins and the associated users in each database.
exec sp_linkedservers— Returns the list of linked servers defined in the local server.
exec sp_helplinkedsrvlogin—
Provides information about login mappings defined against a specific
linked server used for distributed queries and remote stored procedures.
exec sp_server_info— Returns a list of attribute names and matching values for Microsoft SQL Server.
exec sp_helpdb dbnamexyz—
Provides information about a specified database or all databases. This
includes the database allocation names, sizes, and locations.
use dbnamexyz
go
exec sp_spaceused
exec sp_spaceused—
Set of SQL statements that provide the actual database usage
information of both data and indexes for the specified database name (dbnamexyz).
use dbnamexyz
go
exec sp_spaceused
go
exec sp_configure– Get the current SQL Server configuration values by running sp_configure (with the “show advanced option”):
USE master
EXEC sp_configure 'show advanced option', '1'
RECONFIGURE
go
EXEC sp_configure
Go
name minimum maximum config_value
run_value
—---------------------------------- -------- ------- -------------
access check cache bucket count 0 65536 0 0
access check cache quota 0 2147483647 0 0
Ad Hoc Distributed Queries 0 1 0 0
affinity I/O mask -2147483648 2147483647 0 0
affinity mask -2147483648 2147483647 0 0
affinity64 I/O mask -2147483648 2147483647 0 0
affinity64 mask -2147483648 2147483647 0 0
Agent XPs 0 1 1 1
allow updates 0 1 0 0
awe enabled 0 1 0 0
backup compression default 0 1 0 0
blocked process threshold (s) 0 86400 0 0
c2 audit mode 0 1 0 0
clr enabled 0 1 0 0
common criteria compliance enabled 0 1 0 0
cost threshold for parallelism 0 32767 5 5
cross db ownership chaining 0 1 0 0
cursor threshold -1 2147483647 -1 -1
Database Mail XPs 0 1 0 0
default full-text language 0 2147483647 1033
1033
default language 0 9999 0 0
default trace enabled 0 1 1 1
disallow results from triggers 0 1 0 0
EKM provider enabled 0 1 0 0
filestream access level 0 2 2 2
fill factor (%) 0 100 0 0
ft crawl bandwidth (max) 0 32767 100
100
ft crawl bandwidth (min) 0 32767 0 0
ft notify bandwidth (max) 0 32767 100
100
ft notify bandwidth (min) 0 32767 0 0
index create memory (KB) 704 2147483647 0 0
in-doubt xact resolution 0 2 0 0
lightweight pooling 0 1 0 0
locks 5000 2147483647 0 0
max degree of parallelism 0 64 0 0
max full-text crawl range 0 256 4 4
max server memory (MB) 16 2147483647 2147483647
2147483647
max text repl size (B) -1 2147483647 65536
65536
max worker threads 128 32767 0 0
media retention 0 365 0 0
min memory per query (KB) 512 2147483647 1024
1024
min server memory (MB) 0 2147483647 0 0
nested triggers 0 1 1 1
network packet size (B) 512 32767 4096
4096
Ole Automation Procedures 0 1 0 0
open objects 0 2147483647 0 0
optimize for ad hoc workloads 0 1 0 0
PH timeout (s) 1 3600 60
60
precompute rank 0 1 0 0
priority boost 0 1 0 0
query governor cost limit 0 2147483647 0 0
query wait (s) -1 2147483647 -1 -1
recovery interval (min) 0 32767 0 0
remote access 0 1 1 1
remote admin connections 0 1 0 0
remote login timeout (s) 0 2147483647 20 20
remote proc trans 0 1 0 0
remote query timeout (s) 0 2147483647 600
600
Replication XPs 0 1 0 0
scan for startup procs 0 1 0 0
server trigger recursion 0 1 1 1
set working set size 0 1 0 0
show advanced options 0 1 1 1
SMO and DMO XPs 0 1 1 1
SQL Mail XPs 0 1 0 0
transform noise words 0 1 0 0
two digit year cutoff 1753 9999 2049
2049
user connections 0 32767 0 0
user options 0 32767 0 0
xp_cmdshell 0 1 0 0
Disk
configurations, sizes, and current size availability (use standard OS
directory listing commands on all disk volumes being used).
Capture the sa login password and OS administrator password so that anything can be accessed and anything can be installed (or re-installed).
Document all contact information for your vendors:
Microsoft support services contacts (do you use “Premier Product Support Services”?)
Storage vendor contact info
Hardware vendor contact info
Offsite storage contact info (to get your archived copy fast)
Network/telecom contact info
Your CTO, CIO, and other senior management contact info
CD-ROMs available for everything (SQL Server, service packs, operating system, utilities, and so on)
sqldiag.exe
One good way to get a complete environmental picture is to run the sqldiag.exe
program provided with SQL Server 2008 on your production box (which you
would have to re-create on an alternate site if a disaster occurred).
It is located in the Binn directory where all SQL Server executables reside (C:\Program Files\Microsoft SQL Server\100\Tools\Binn).
It shows how the server is configured, all hardware and software
components (and their versions), memory sizes, CPU types, operating
system version and build information, paging file information,
environment variables, and so on. If you run this program on your
production server periodically, it serves as good environment
documentation to supplement your disaster recovery plan. This utility
is also used to capture and diagnose SQL Server-wide issues and has a
prompt that you must respond to when re-creating issues on which you
want to collect diagnosis information. For the purposes of this
chapter, when prompted for the SQLDIAG Collection, you can just
terminate that portion by pressing Ctrl+C. Figure 1 shows the expected execution DOS windows and system information dialog window.
To run this utility, you open a DOS command prompt and change directory to the SQL Server Binn directory. Then, at the command prompt, you run sqldiag.exe:
C:\Program Files\Microsoft SQL Server\100\Tools\Binn> sqldiag.exe
The results are written into several text files within the SQLDIAG
subdirectory. Each file contains different types of data about the
physical machine (server) that SQL Server is running on and information
about each SQL Server instance. The machine (server) information is
stored in a file named XYX_MSINFO32.TXT, where XYX
is the machine name. It really contains a verbose snapshot of
everything that relates to SQL Server (in one way or another) and all
the hardware configuration, drivers, and so on. It is the tightly
coupled metadata and configuration information directly related to the
SQL Server instance. The following is an example of what it contains:
System Information report written at: 09/11/09 22:13:16
System Name: DBARCH-LT2
[System Summary]
Item Value
OS Name Microsoft® Windows Vista™ Home Premium
Version 6.0.6001 Service Pack 1 Build 6001
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name DBARCH-LT2
System Manufacturer Hewlett-Packard
System Model HP G60 Notebook PC
System Type x64-based PC
Processor Pentium(R) Dual-Core CPU T4300 @ 2.10GHz, 2100 Mhz, 2 Core(s),
2 Logical Processor(s)
BIOS Version/Date Hewlett-Packard F.3C, 6/23/2009
SMBIOS Version 2.4
Windows Directory C:\Windows
System Directory C:\Windows\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "6.0.6001.18000"
User Name DBARCH-LT2\DBARCH
Time Zone Pacific Daylight Time
Installed Physical Memory (RAM) Not Available
Total Physical Memory 3.90 GB
Available Physical Memory 1.87 GB
Total Virtual Memory 8.04 GB
Available Virtual Memory 5.63 GB
Page File Space 4.20 GB
Page File C:\pagefile.sys
and so on.
A separate file is generated for each SQL Server instance you have installed on a server. These files are named XYZ_ABC_sp_sqldiag_Shutdown.OUT, where XYZ is the machine name and ABC
is the SQL Server instance name. This file contains most of the
internal SQL Server information regarding how it is configured,
including a snapshot of the SQL Server log as this server is operating
on this machine. The following example shows this critical information
from the DBARCH-LT2_SQL08DE01_sp_sqldiag_Shutdown.OUT file:
2009-09-07 23:50:21.540 Server Microsoft SQL Server 2008 (SP1) - 10.0.2531.0
(X64)
Mar 29 2009 10:11:52
Copyright (c) 1988-2008 Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.0 <X64> (Build 6001: Service Pack 1)
2009-09-07 23:50:21.560 Server (c) 2005 Microsoft Corporation.
2009-09-07 23:50:21.560 Server All rights reserved.
2009-09-07 23:50:21.560 Server Server process ID is 1884.
2009-09-07 23:50:21.560 Server Logging SQL Server messages in file
'C:\Program Files\Microsoft SQL Server\MSSQL10.SQL08DE01\MSSQL\Log\ERRORLOG'.
2009-09-07 23:50:21.570 Server Registry startup parameters:
-d C:\Program Files\Microsoft SQL
Server\MSSQL10.SQL08DE01\MSSQL\DATA\master.mdf
-e C:\Program Files\Microsoft SQL Server\MSSQL10.SQL08DE01\MSSQL\Log\ERRORLOG
-l C:\Program Files\Microsoft SQL Server\MSSQL10.SQL08DE01\MSSQL\DATA\mast-
log.ldf
2009-09-07 23:50:21.610 Server Detected 2 CPUs.
This is an informational message; no user action is required.
2009-09-07 23:50:21.910 Server Using dynamic lock allocation.
Initial allocation of 2500 Lock blocks and 5000 Lock Owner blocks per node.
This is an informational message only. No user action is required.
2009-09-07 23:50:23.050 spid7s FILESTREAM: effective level = 3,
configured level = 3, file system access share name = 'SQL08DE01'.
2009-09-07 23:50:23.820 spid7s Server name is 'DBARCH-LT2\SQL08DE01'.
This is an informational message only. No user action is required.
From
this output, you are able to ascertain the complete SQL Server instance
information as it was running on the primary site. It is excellent
documentation for your SQL Server implementation. We suggest that you
run this utility regularly and compare the outcome with prior
executions to guarantee that you know exactly what you have to have in
place in case of disaster.
Planning and Executing a Disaster Recovery
The process of planning
and executing a complete disaster recovery is serious business, and
many companies around the globe set aside a few days a year to perform
this exact task. Here’s what it involves:
Simulate a disaster.
Record all actions taken.
Time all events from start to finish. Sometimes this means someone is standing around with a stopwatch.
Hold a postmortem following the DR simulation.
Many
companies tie the results of a DR simulation to the IT group’s salaries
(their raise percentage). This is more than enough motivation for IT to
get this drill right and to perform well.
Correcting any failures or issues that occur is critical. The next time might not be a simulation.